35 research outputs found

    Towards an Environment for doing Data Science that runs in Browsers

    Get PDF
    International audience—This article proposes a path for doing Data Science using browsers as computing and data nodes. This novel idea is motivated by the cross-fertilized fields of desktop grid computing, data management in grids and clouds, Web technologies such as Nosql tools, models of interactions and programming models in grids, cloud and Web technologies. We propose a methodology for the modeling, analyzing, implemention and simulation of a prototype able to run a MapReduce job in browsers. This work allows to better understand how to envision the big picture of Data Science in the context of the Javascript language for programming the middleware, the interactions between components and browsers as the operating system. We explain what types of applications may be impacted by this novel approach and, from a general point of view how a formal modeling of the interactions serves as a general guidelines for the implementation. Formal modeling in our methodology is a necessary condition but it is not sufficient. We also make round-trips between the modeling and the Javascript or used tools to enrich the interaction model that is the key point, or to put more details into the implementation. It is the first time to the best of our knowledge that Data Science is operating in the context of browsers that exchange codes and data for solving computational and data intensive programs. Computational and data intensive terms should be understand according to the context of applications that we think to be suitable for our system

    BitDew: A Programmable Environment for Large-Scale Data Management and Distribution

    Get PDF
    Desktop Grids use the computing, network and storage resources from idle desktop PC's distributed over multiple-LAN's or the Internet to compute a large variety of resource-demanding distributed applications. While these applications need to access, compute, store and circulate large volumes of data, little attention has been paid to data management in such large-scale, dynamic, heterogeneous, volatile and highly distributed Grids. In most cases, data management relies on ad-hoc solutions, and providing general approach is still a challenging issue. To address this problem, we propose the BitDew framework, a programmable environment for automatic and transparent data management on computational Desktop Grids. This paper describes the BitDew programming interface, its architecture, and the performance evaluation of its runtime components. BitDew relies on a specific set of meta-data to drive key data management operations, namely life cycle, distribution, placement, replication and fault-tolerance with a high level of abstraction. The Bitdew runtime environment is a flexible distributed service architecture that integrates modular P2P components such as DHT's for a distributed data catalog and collaborative transport protocols for data distribution. Through several examples, we describe how application programmers and Bitdew users can exploit Bitdew's features. The performance evaluation demonstrates that the high level of abstraction and transparency is obtained with a reasonable overhead, while offering the benefit of scalability, performance and fault tolerance with little programming cost

    Availability and Network-Aware MapReduce Task Scheduling over the Internet

    Get PDF
    International audienceMapReduce offers an ease-of-use programming paradigm for processing large datasets. In our previous work, we have designed a MapReduce framework called BitDew-MapReduce for desktop grid and volunteer computing environment, that allows nonexpert users to run data-intensive MapReduce jobs on top of volunteer resources over the Internet. However, network distance and resource availability have great impact on MapReduce applications running over the Internet. To address this, an availability and network-aware MapReduce framework over the Internet is proposed. Simulation results show that the MapReduce job response time could be decreased by 27.15%, thanks to Naive Bayes Classifier-based availability prediction and landmark-based network estimation

    D 3 -MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets

    Get PDF
    International audienceSince its introduction in 2004 by Google, MapRe-duce has become the programming model of choice for processing large data sets. Although MapReduce was originally developed for use by web enterprises in large data-centers, this technique has gained a lot of attention from the scientific community for its applicability in large parallel data analysis (including geographic, high energy physics, genomics, etc.). So far MapReduce has been mostly designed for batch processing of bulk data. The ambition of D 3-MapReduce is to extend the MapReduce programming model and propose efficient implementation of this model to: i) cope with distributed data sets, i.e. that span over multiple distributed infrastructures or stored on network of loosely connected devices; ii) cope with dynamic data sets, i.e. which dynamically change over time or can be either incomplete or partially available. In this paper, we draw the path towards this ambitious goal. Our approach leverages Data Life Cycle as a key concept to provide MapReduce for distributed and dynamic data sets on heterogeneous and distributed infrastructures. We first report on our attempts at implementing the MapReduce programming model for Hybrid Distributed Computing Infrastructures (Hybrid DCIs). We present the architecture of the prototype based on BitDew, a middleware for large scale data management, and Active Data, a programming model for data life cycle management. Second, we outline the challenges in term of methodology and present our approaches based on simulation and emulation on the Grid'5000 experimental testbed. We conduct performance evaluations and compare our prototype with Hadoop, the industry reference MapReduce implementation. We present our work in progress on dynamic data sets that has lead us to implement an incremental MapReduce framework. Finally, we discuss our achievements and outline the challenges that remain to be addressed before obtaining a complete D 3-MapReduce environment

    ANALYSES AVANCÉES DE LA MÉTHODE HYBRIDE GMRES/LS-ARNOLDI ASYNCHRONE PARALLÈLE ET DISTRIBUÉE POUR LES GRILLES DE CALCUL ET LES SUPERCALCULATEURS

    No full text
    Many scientific and industrial problems need the resolution of nonsymmetric linear systems of large scale, which are described by sparse matrices of very large size. We frequently use the iterative numerical methods and benefit from parallelism for a fast and effective resolution. The GMRES(m) algorithm is an iterative method which gives good results in most cases. Nevertheless we observe the limitation of its parallelization because of much provoked communications, in some case convergence is reached very slowly even never. We present in this thesis a hybrid method GMRES(m)/LS-Arnoldi which accelerates the convergence thanks to the knowledge of the eigenvalues calculated in parallel by the method of Arnoldi for the real cases with its implementation on the supercomputers. Furthermore we study an extension of complex cases. The latest tendency of global computing, the GRID computing proposes the massive exploitation of the vacant resources on the local area networks and on world wide Internet for the execution of parallel applications. The XtremWeb environment is a secured, light GRID system, with the failures tolerance mechanism for the execution of parallel applications. It is a high-performance computing environment, a software GRID platform of experimentation for academic or industrial organisation. We present in this thesis the implementation of the GMRES(m) method on this GRID system XtremWeb as well as a distributed computing environment LAM-MPI. We made numerous tests on GRID and supercomputer. From performances which we obtained, we note the advantages and the disadvantages for these different computingDe nombreux problèmes scientifiques et industriels ont besoin de la résolution de systèmes linéaires non symétriques à grande échelle, qui sont décrits par des matrices creuses de très grande taille. On utilise fréquemment dans ce cas des méthodes numériques itératives et on fait appel au parallélisme pour une résolution rapide et efficace. L'algorithme GMRES(m) est une méthode itérative qui donne de bons résultats dans la plupart des cas. Mais on observe une limitation à sa parallélisation en raison des nombreuses communications produites. Dans quelques cas, la convergence est atteinte très lentement, voire jamais. Nous présentons dans cette thèse une méthode hybride GMRES(m)/LS-Arnoldi qui accélère la convergence grâce à la connaissance des valeurs propres calculées parallèlement par la méthode d'Arnoldi pour les cas réels, avec son implantation sur des supercalculateurs. Une extension aux cas complexes est également étudiée. La dernière tendance du calcul global, le calcul de grille, propose l'exploitation massive des ressources vacantes des réseaux locaux ainsi que sur Internet. Son avantage peut être énorme pour l'exécution d'applications parallèles. L'environnement XtremWeb est un système de grille léger, tolérant aux défaillances et sécurisé pour l'exécution d'applications parallèles. Il est un environnement de calcul haute-performance, une plate- forme de grille logicielle d'expérimentation pour des institutions académiques ou industrielles. Nous présentons dans cette thèse les implantations de la méthode GMRES(m) sur ce système de grille XtremWeb ainsi que sur un environnement distribué de calcul LAM-MPI. Nous avons fait de multiples tests sur grille et supercalculateur. Des performances que nous avons obtenues, nous constatons les avantages et les inconvénients de ces plates-formes de calcul différentes

    Mise en œuvre de la découverte de services pour des plateformes dynamiques à large échelle

    No full text
    National audienceLa découverte de services s'avère être une fonctionnalité critique des plateformes dynamiques à large échelle. Dans le cadre de la plateforme pétascale Spades (Servicing Petascale Architecture and DistributEd System), nous avons fait le choix de décentraliser cette fonctionnalité et de la faire reposer sur notre propre structure de données en arbre préfixe : la Dlpt (Distributed Lexicographic Placement Table). Dans cet article, nous proposons une mise en oeuvre de ces concepts : l'intergiciel Sbam (Spades BAsed Middleware). Dans un second temps, nous conduisons une série d'expériences afin d'évaluer la tenue en charge de l'implémentation pair à pair proposée. Nous observons le temps d'accès en lecture à la structure

    Analyses avancées de la méthode hybride GMRES/LS-Arnoldi asynchrone parallèle et distribuée pour les grilles de calcul et les supercalculateurs

    No full text
    De nombreux problèmes scientifiques et industriels ont besoin de la résolution de systèmes linéaires non symétriques à grande échelle, qui sont décrits par des matrices creuses de très grande taille. On utilise. fréquemment dans ce cas des méthodes numériques itératives et on fait appel au parallélisme pour une résolution rapide et efficace. L'algorithme GMRES(m) est une méthode iterative qui donne de bons résultats dans la plupart des cas. Mais on observe une limitation à sa parallélisation en raison des nombreuses communications produites. Dans quelques cas, la convergence est atteinte très lentement: voire jamais. Nous présentons dans cette thèse une méthode hybride GMRES(m)/LS-Arnoldi qui accélère la convergence grâce à la connaissance des valeurs propres calculées parallèlement par la méthode d'Arnoldi pour les cas réels, avec son implantation sur des supercalculateurs. Une extension aux cas complexes est également étudiée. La dernière tendance du calcul global, le calcul de grille, propose l'exploitation massive des ressources vacantes des réseaux locaux ainsi que sur Internet. Son avantage peut être énorme pour l'exécution d'applications parallèles. L'environnement Xtrem Web est un système de grille léger, tolérant aux défaillances et sécurisé pour l'exécution d'applications parallèles. Il est un environnement de calcul haute-performance, une plate-forme de grille logicielle d'expérimentation pour des institutions académiques ou industrielles. Nous présentons dans cette thèse les implantations de la méthode GMRES(m) sur ce système de grille XtremWeb ainsi que sur un environnement distribué de calcul LAM-MPI. Nous avons fait de multiples tests sur grille et supercalcuIateur. Des performances que nous avons obtenues, nous constatons les avantages et les inconvénients de ces plates-formes de calcul différentes.LILLE1-BU (590092102) / SudocSudocFranceF

    Shortest Processing Time First and Hadoop

    Get PDF
    International audienceBig data has revealed itself as a powerful tool for many sectors ranging from science to business. Distributed data-parallel computing is then common nowadays: using a large number of computing and storage resources makes possible data processing of a yet unknown scale. But to develop large-scale distributed big data processing, one have to tackle many challenges. One of the most complex is scheduling. As it is known to be an optimal online scheduling policy when it comes to minimize the average flowtime, Shortest Processing Time First (SPT) is a classic scheduling policy used in many systems. We then decided to integrate this policy into Hadoop, a framework for big data processing, and realize an implementation prototype. This paper describes this integration, as well as tests results obtained on our testbed
    corecore